Integrating Ngram Model and Case-based Learning for Chinese Word Segmentation
نویسندگان
چکیده
This paper presents our recent work for participation in the First International Chinese Word Segmentation Bakeoff (ICWSB-1). It is based on a generalpurpose ngram model for word segmentation and a case-based learning approach to disambiguation. This system excels in identifying in-vocabulary (IV) words, achieving a recall of around 96-98%. Here we present our strategies for language model training and disambiguation rule learning, analyze the system’s performance, and discuss areas for further improvement, e.g., out-of-vocabulary (OOV) word discovery.
منابع مشابه
Modeling of Long Distance Context Dependency in Chinese
Ngram modeling is simple in language modeling and has been widely used in many applications. However, it can only capture the short distance context dependency within an N-word window where the largest practical N for natural language is three. In the meantime, much of context dependency in natural language occurs beyond a three-word window. In order to incorporate this kind of long distance co...
متن کاملAn Improved CRF based Chinese Language Processing System for SIGHAN Bakeoff 2007
This paper describes three systems: the Chinese word segmentation (WS) system, the named entity recognition (NER) system and the Part-of-Speech tagging (POS) system, which are submitted to the Fourth International Chinese Language Processing Bakeoff. Here, Conditional Random Fields (CRFs) are employed as the primary models. For the WS and NER tracks, the ngram language model is incorporated in ...
متن کاملA Hybrid Model for Chinese Word Segmentation
This paper describes a hybrid model that combines machine learning with linguistic and statistical heuristics for integrating unknown word identification with Chinese word segmentation. The model consists of two major components: a tagging component that annotates each character in a Chinese sentence with a position-of-character (POC) tag that indicates its position in a word, and a merging com...
متن کاملLinguistic tuple segmentation in n-gram-based statistical machine translation
Ngram-based Statistical Machine Translation relies on a standard Ngram language model of tuples to estimate the translation process. In training, this translation model requires a segmentation of each parallel sentence, which involves taking a hard decision on tuple segmentation when a word is not linked during word alignment. This is especially critical when this word appears in the target lan...
متن کاملCombining Machine Learning with Linguistic Heuristics for Chinese Word Segmentation
This paper describes a hybrid model that combines machine learning with linguistic heuristics for integrating unknown word identification with Chinese word segmentation. The model consists of two components: a position-of-character (POC) tagging component that annotates each character in a sentence with a POC tag that indicates its position in a word, and a merging component that transforms a P...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of Chinese Language and Computing
دوره 14 شماره
صفحات -
تاریخ انتشار 2003